OKCupid Supervised Machine Learning Revisited

Project Author: Alexander Lacson

This is the continuation of a project where I analyze data from OKCupid. The first part is here. I felt like the section on predictive models from the previous work needed improvement. This notebook is a revision and expansion of that section.

Block Diagram of Model Tuning and Evaluation

The diagram below comes from scikitlearn docs. This is the procedure that we will follow. The left section outlines model parameter tuning process. The right section outlines the model test setup and evaluation process.

Load data

Separate Predictors and Labels

Import Libraries

Model Parameter Tuning

I suppose some explanations are necessary before you dive in below. GridSearchCV is a function for repeatedly building a model using different combinations of a set of specified parameter values. In addition, GridsearchCV also makes use of StratifiedKfold, to report a model score. It is your decision which evaluation metric to use as the score, but by default accuracy is used. If you do not pass an argument to the cv parameter of your GridSearchCV function, it uses k=5, or 5-fold cross validation, by default. However, if you want to specify the parameters, or choose a different validator such as RepeatStratifiedKfold, then you pass a validator function, complete with specified parameters, as the argument of cv. By not passing an argument below I have used the default cv.

It's a good idea to reconsider what our project objective was. Our original goal was "Use Machine Learning to predict gender". In this case we don't want the model to discriminate against gender. As much as possible we want it to perform equally well for either class. Which is why the metric we will use as the model score for GridSearchCV will be accuracy, its default.

Logistic Regression Model

Tuning (Parameter = C)

C represents regularization) strength.

Results

Decision Tree Model

Tuning (Parameter = max_depth)

Max Tree Depth

Results

Random Forest Model

Tuning (Parameters = max_depth, n_estimators)

Max Tree Depth and Number of Trees in the Forest

Results

The 3D Scatter plot can be rotated, panned, and zoomed.

Tuning Duration

Interesting Observation

Note that with the Random Forest we are still gaining better performance with higher max depth, unlike with the Decision Tree which completely drops after peaking at depth = 7.

Dataset splitting

Let's set aside our test and train sets.

Retraining Models using Optimized Parameters

Model Evaluation and Comparison

The plot may be zoomed in on the rightmost region, in order to further highlight differences.

In all cases the models perform better at classifying males than females. In the best case, the logistic regression model, men are classified 6.77% times better than women. A consequence of either training on a male-skewed dataset or not having enough reliable features that allow the model to confidently classify women (or both).

This is not without consequence in the real world. I recommend the Netflix documentary "Coded Bias". In the documentary, during a senate hearing, Alexandra Ocasio-Cortez questions Joy Buolamwini. Here is a selected excerpt:

AOC: "What demographic is it [AI models] mostly effective on?"
JB: "White Men"
AOC: "And who are the primary engineers and designers of these algorithms?"
JB: "Definitely white men"

Despite the oral exchange above, it's possible to make a biased AI model without being a "white man". You could be an ML engineer who failed to properly evaluate the performance of your model at identifying all class labels. This issue has already entered the mainstream social and political spheres.

Examination of Predictor Weights and Importances

For all models, height and body_type_curvy are our top predictors. Probably because men are taller than women on average, and because men are not likely to describe themselves as curvy whereas women are. It is interesting to see that after the top two predictors, there is a different order of feature importances for the random forest compared to the other models.

With Logistic Regression we can conveniently see which features were more useful for predicting class (male or female) because we have negative and positive weight coefficients, unlike with the Random Forest and Decision Tree. We can see that the model has more confidence in its male predictors than in its female predictors.

Next Steps

Although not presented here, most likely, because the age distribution is skewed towards the young, the model is more effective at classifying young people than old people. The way to alleviate this is to apply a power transform to the age distribution, which makes it more like a normal distribution, before training (you will also have to apply the same transform to the test set before predicting). We will then have to evaluate the model's performance with young vs old subsets of our data.

Model evaluation can still be taken steps further. In scikit learn there is an example of applying frequentist and bayesian statistical approaches to more definitively make model comparisons.

Now that you know how to evaluate a model, the challenge is to learn how to improve model performance. Not just in general, but to be fair at identifying all classes. You will rarely ever have a dataset that is perfectly balanced.